Instructions

As in the previous assignment, youl'll be using PyTorch instead of EDF. This assignment will focus on generative modelling, and you'll implement and train a VAE and a GAN.

It is highly suggested to use google colab and run the notebook on a GPU node. Check https://colab.research.google.com/ and look for tutorials online on how to use it. To use a GPU go to Runtime -> Change runtime type and select GPU.

In [1]:
import torch, math, copy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torchvision
from torchvision import datasets, transforms
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F
from scipy.stats import kde
import matplotlib.pyplot as plt

We'll start by coding up a toy problem and seeing how a VAE and a GAN behave on it. Consider the following stochastic process: $$ \mu_x \sim U(\{1,2,3\})$$ $$ \mu_y \sim U(\{1,2,3\})$$ $$ s \sim \mathcal N \left([\mu_x, \mu_y], \frac{1}{100}I \right)$$ where $I$ is the $2 \times 2$ identity matrix.

Implement the function in the next cell such that it returns $n$ samples distributed as $s$ from the above process. The returned object should be a $n \times 2$ PyTorch tensor.

In [ ]:
def sample(n):
    
    mean = torch.randint(1, 4, (n,2), dtype=torch.float32)
    std = torch.ones(n,2)/100
    s = torch.normal(mean,std)
    
    return s

Now we'll sample 1000 points and see how they are distributed.

In [ ]:
def plot_density(data):
    data = data.numpy()
    nbins = 50
    x, y = data.T
    k = kde.gaussian_kde(data.T)
    xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
    zi = k(np.vstack([xi.flatten(), yi.flatten()]))
    
    plt.pcolormesh(xi, yi, zi.reshape(xi.shape), shading='gouraud', cmap=plt.cm.BuGn_r)

    plt.tight_layout()
    plt.show()
    plt.clf()
In [ ]:
data = sample(1000)
plot_density(data)
<Figure size 432x288 with 0 Axes>

VAE on a Toy Problem

Recall that when training a VAE we're concerned with the following problem:

$$\min_{\phi} \,\ \mathbb E_{x \sim Pop, z \sim P_\phi(z|x)} \left[ \ln \frac{P_\phi(z|x)}{P(z)} - \ln P_\phi(x|z) \right] \,.$$

We'll model $P_\phi(z|x)$ with an encoder and $P_\phi(x|z)$ with a decoder as follows: $$P_\phi(z|x) = \mathcal N \left(\mu_{\phi,z}(x), \Sigma_{\phi,z}(x) \right)$$ $$P_\phi(x|z) = \mathcal N \left( \mu_{\phi,x}(z), \sigma^2 I \right) \,,$$ where $\mu_{\phi,z}, \Sigma_{\phi,z}, \mu_{\phi,x}$ are neural networks, and $\Sigma_{\phi,z}(x)$ is diagonal.

Moreover, let $P(z)$ (the prior over $z$) to be $\mathcal N(0, I)$.

For the above distributions, what is $\ln P_\phi(x|z)$ as a function of $x, z, \mu_{\phi,x}$, and $\sigma$?

------------------------------------------------------------------------------- ANSWER (BEGIN) -------------------------------------------------------------------------------

In general, a multivariate normal distribution over $n$ variables is defined as

$$ P(y) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp-\frac{1}{2}(y-\mu)^{T}\Sigma^{-1}(y-\mu) $$

In this case we have

\begin{eqnarray} \ln P_\phi(x|z) &=& \ln \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp-\frac{1}{2\sigma^{2}}(x-\mu_{\phi,x}(z))^{T}(x-\mu_{\phi,x}(z)) \\ &=& -\frac{1}{2}[n\ln 2\pi + 2n\ln \sigma + \frac{1}{\sigma^{2}}(x-\mu_{\phi,x}(z))^{T}(x-\mu_{\phi,x}(z))] \end{eqnarray}

Taking $\sigma = 1$, we are left with

\begin{eqnarray} -\frac{1}{2}[n\ln 2\pi + (x-\mu_{\phi,x}(z))^{T}(x-\mu_{\phi,x}(z))] \end{eqnarray}

according to our objective, we take the expectation over $x$ and $z$

\begin{eqnarray} \frac{1}{2}\mathbb E_{x \sim Pop, z \sim P_{\phi}(z|x)}[n\ln 2\pi + (x-\mu_{\phi,x}(z))^{T}(x-\mu_{\phi,x}(z))] \end{eqnarray}

------------------------------------------------------------------------------- ANSWER (END) -------------------------------------------------------------------------------

For the above distributions, what is $\ln \frac{P_\phi(z|x)}{P(z)}$ as a function of $x, \mu_{\phi,z}, \Sigma_{\phi,z}$?

------------------------------------------------------------------------------- ANSWER (BEGIN) -------------------------------------------------------------------------------

\begin{eqnarray} \DeclareMathOperator{\Tr}{Tr} \ln \frac{P_\phi(z|x)}{P(z)} &=& \ln P_\phi(z|x) - \ln P(z)\\ &=& \ln \frac{1}{(2\pi)^{n/2}|\Sigma_{1}|^{1/2}}\exp-\frac{1}{2}(z-\mu_{1})^{T}\Sigma_{1}^{-1}(z-\mu_{1}) - \ln \frac{1}{(2\pi)^{n/2}|\Sigma_{2}|^{1/2}}\exp-\frac{1}{2}(z-\mu_{2})^{T}\Sigma_{2}^{-1}(z-\mu_{2})\\ &=& \frac{1}{2}[\ln \frac{|\Sigma_{2}|}{|\Sigma_{1}|} - (z-\mu_{1})^{T}\Sigma_{1}^{-1}(z-\mu_{1}) + (z-\mu_{2})^{T}\Sigma_{2}^{-1}(z-\mu_{2})]\\ \end{eqnarray}

If we now take the expectation of this result over $x$ and $z$

\begin{eqnarray} \DeclareMathOperator{\tr}{tr} \mathbb E_{x \sim Pop, z \sim P_\phi(z|x)} \ln \frac{P_\phi(z|x)}{P(z) } &=& \frac{1}{2} \mathbb E_{x}\mathbb E_{z} [\ln \frac{|\Sigma_{2}|}{|\Sigma_{1}|} - (z-\mu_{1})^{T}\Sigma_{1}^{-1}(z-\mu_{1}) + (z-\mu_{2})^{T}\Sigma_{2}^{-1}(z-\mu_{2})]\\ &=& \frac{1}{2}\mathbb E_{x}[\ln \frac{|\Sigma_{2}|}{|\Sigma_{1}|} + \mathbb E_{z} [-(z-\mu_{1})^{T}\Sigma_{1}^{-1}(z-\mu_{1}) + (z-\mu_{2})^{T}\Sigma_{2}^{-1}(z-\mu_{2})]]\\ &=& \frac{1}{2}\mathbb E_{x}[\ln \frac{|\Sigma_{2}|}{|\Sigma_{1}|} + \mathbb E_{z} [-\tr(\Sigma_{1}^{-1}(z-\mu_{1})(z-\mu_{1})^{T}) + -\tr(\Sigma_{2}^{-1}(z-\mu_{2})(z-\mu_{2})^{T})]]\\ &=& \frac{1}{2}\mathbb E_{x}[\ln \frac{|\Sigma_{2}|}{|\Sigma_{1}|} -d + \tr(\Sigma_{2}^{-1}\Sigma_{1}) + (\mu_{2}-\mu_{1})^{T}\Sigma_{2}^{-1}(\mu_{2}-\mu_{1})]\\ \end{eqnarray}

Adapting this general equation to our case, we have

$$ \frac{1}{2}\mathbb E_{x\sim Pop}-\ln |\Sigma_{\phi,z}(x)| - d + \tr(\Sigma_{\phi,z}(x)) + \mu_{\phi,z}(x)^{T}\mu_{\phi,z}(x)\\ $$

------------------------------------------------------------------------------- ANSWER (END) -------------------------------------------------------------------------------

We are almost ready to set up a VAE network in PyTorch and train it. The following cell has an incomplete implementation of a VAE. The encoder and decoder networks are already defined (note that the encoder outputs $\log \Sigma$ instead of $\Sigma$, which is standard practice since otherwise we have to guarantee that the covariance matrix is non-negative). latent_dim is the dimensionality of the latent variable $z$.

Complete the implementations of encode, sample, and decode. The encode method receives samples $x$ and has to return the mean vector $\mu_z(x)$ and the element-wise log of the diagonal of $\Sigma_z(x)$. The self.encoder network already maps $x$ to a 50-dim vector, and the self.mu, self.logvar modules can be used to map this 50-dim vector to the mean vector and the log diag of the covariance matrix.

The sample method receives mu and logvar (the outputs of encode) and has to return samples from the corresponding Gaussian distribution. Here we typically employ the reparameterization trick, where we can draw a sample $s \sim \mathcal N(\mu, \sigma)$ by doing $s = \mu + \sigma \cdot \epsilon, \epsilon \sim \mathcal N(0, 1)$, which yields well-defined gradients that autograd takes care of computing.

Finally, the decode method takes $z$ as input and should return $\mu_x(z)$. You should use the self.decodet module for this.

In [ ]:
class VAE(nn.Module):
    def __init__(self, latent_dim):
        super(VAE, self).__init__()
        self.latent_dim = latent_dim

        self.encoder = nn.Sequential(
            nn.Linear(2, 50),
            nn.ReLU(),
            nn.Linear(50, 50),
            nn.ReLU()
        )

        self.mu = nn.Linear(50, latent_dim)
        self.logvar = nn.Linear(50, latent_dim)

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 50),
            nn.ReLU(),
            nn.Linear(50, 50),
            nn.ReLU(),
            nn.Linear(50, 2)
        )

    def encode(self, x):
        encoded = self.encoder(x)
        mu = self.mu(encoded)
        logvar = self.logvar(encoded)
        return mu, logvar
    
    def sample(self, mu, logvar):
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        z = mu + (eps * std)
        return z
    
    def decode(self, z):
        out = self.decoder(z)
        return out
        
    def forward(self, x):
        mu, logvar = self.encode(x) #dist params for each latent variable
        z = self.sample(mu, logvar) #sample latent vector from latent dist
        out = self.decode(z) #decode latent vector
        return mu, logvar, out
    
    def generate(self, n):
        z = torch.randn(n, self.latent_dim).cuda()
        samples = self.decode(z)
        return samples

Finally, implement the loss of the VAE by using the equations you derived previously. The recon_loss term should have the factor corresponding to $P(x|z)$, while kld_loss should have the KL divergence term.

In your derivation $\sigma$ hopefully showed up as a weight between the two terms. Here we'll use the standard beta-VAE notation and apply a weight beta on the KL divergence term instead.

In [ ]:
def loss(x, out, mu, logvar, beta):

    diff = x - out
    latent_dim = len(logvar)

    #Compute reconstruction loss
    mse = nn.MSELoss()
    recons_loss = 0.5*(latent_dim*np.log(2*np.pi) + mse(x, out))

    #Compute KL loss
    kld_loss = -0.5 * torch.mean(torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1))

    #Compute total loss
    loss = recons_loss + beta * kld_loss

    return recons_loss, kld_loss, loss

We can then train the VAE on the toy problem and see how it performs. Try different values of beta until you find one that yields good results.

In [ ]:
vae = VAE(100).cuda()
opt = torch.optim.Adam(vae.parameters(), lr=5e-4)
In [ ]:
beta = 0.01
for i in range(20000):
    s = sample(128).cuda()
    mu, logvar, out = vae(s)
    rl, kl, l = loss(s, out, mu, logvar, beta)
    opt.zero_grad()
    l.backward()
    opt.step()
    if i % 1000 == 0:
        data = vae.generate(5000)
        plot_density(data.cpu().data)
<Figure size 432x288 with 0 Axes>
In [ ]:
 

How does beta affect the performance of the VAE? Show or discuss what tradeoff beta controls, and how this can be observed from the above plots and/or any additional plots.

------------------------------------------------------------------------------- ANSWER (BEGIN) -------------------------------------------------------------------------------

The above implementation of the VAE only worked well for values of $\beta$ surrounding 0.01. Since $\beta$ controls the "strength" of the KL-divergence term with respect to the reconstruction term in the loss function. If we consider $\beta >> 1$, the KL-divergence term will dominate the loss and the optimization procedure will converge on a model that is more consistent with our assumed prior $P(z)$. In other words, a higher value of $\beta$ puts more emphasis on the similarity of our model $P(z|x)$ to the prior $P(z)$ than on the reconstruction of $x$. On the other hand for $\beta << 1$, we emphasize the reconstruction of the image rather than our assumed prior. For our toy example, we obtain the best results in the latter case. A small value of $\beta$ yields good results within reason -- presumably because a constraint on the statistics in the latent space is necessary.

------------------------------------------------------------------------------- ANSWER (END) -------------------------------------------------------------------------------

GAN

Recall the GAN objective $$\min_\psi \max_\phi \,\ \mathbb E_{x \sim Pop}[ -\ln P_\psi(1 | x) ] + \mathbb E_{z \sim \mathcal N(0,1)} [- \ln P_\psi(0|G_\phi(z)) ] \,,$$ where $G_\phi$ is a network that maps gaussian noise $z \sim \mathcal N(0,1)$ to $G(z)$ with the same shape as $x$, and $P_\psi$ is modeled by another network (the discriminator) that maps real samples $x$ and 'fake' samples $G(z)$ to a distribution over $\{0,1\}$.

We will follow the common practice of adopting a different objective for the generator network $G$: $$\min_\phi \,\ \mathbb E_{z \sim \mathcal N(0,1)} [- \ln P_\psi(1|G_\phi(z)) ] \,.$$

First, complete the implementation of the Generator module below. The forward method takes an integer $n$ as input and should return $n$ samples $G(z), z \sim \mathcal N(0, I)$, each with dimensionality 2. You should use the self.network module for the mapping $G$.

In [ ]:
class Generator(nn.Module):
    def __init__(self, latent_dim):
        super(Generator, self).__init__()
        self.latent_dim = latent_dim

        self.network = nn.Sequential(
            nn.Linear(latent_dim, 50),
            nn.ReLU(),
            nn.Linear(50, 50),
            nn.ReLU(),
            nn.Linear(50, 2)
        )

    def decode(self, input):
        out = self.network(input)
        return out

    def forward(self, n):
        z = torch.normal(0, 1, (n,self.latent_dim)).cuda()
        samples = self.network(z).cuda()
        return samples
In [ ]:
class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()

        self.network = nn.Sequential(
            nn.Linear(2, 50),
            nn.ReLU(),
            nn.Linear(50, 50),
            nn.ReLU(),
            nn.Linear(50, 1)
        )
    
    def forward(self, input):
        out = self.network(input)
        return out
In [ ]:
generator = Generator(100).cuda() #100 latent dimensions
gopt = torch.optim.Adam(generator.parameters(), lr=5e-4, betas=(0.5, 0.999))
discriminator = Discriminator().cuda()
dopt = torch.optim.Adam(discriminator.parameters(), lr=5e-4, betas=(0.5, 0.999))
criterion = torch.nn.BCEWithLogitsLoss()

Now, you'll implement the training procedure for GANs. In each iteration of the for loop below we'll update the parameters of the generator and then update the discriminator.

Fill up the missing code below. You should rely on the objective given previously to define the loss of the generator and the discriminator (both the function, the data inputs, and the target labels).

In [ ]:
batch_size = 128
fake_label = 0.
real_label = 1.

for i in range(50000):

    #Train the discriminator
    label = torch.full((batch_size,), real_label, dtype=torch.float).cuda()

    #Train with real batch
    discriminator.zero_grad()

    #Classify real batch, compute loss
    real_data = sample(batch_size).cuda()
    output = discriminator(real_data).view(-1)
    derror_real = criterion(output, label)
    derror_real.backward()

    #Train with fake batch
    fake = generator(batch_size)
    label.fill_(fake_label)

    # Classify fake batch, compute loss
    output = discriminator(fake.detach()).view(-1)
    derror_fake = criterion(output, label)
    derror_fake.backward()
    
    derror = derror_real + derror_fake
    dopt.step()

    #Train the generator 
    generator.zero_grad()
    label.fill_(real_label)

    #Classify fake batch
    output = discriminator(fake).view(-1)
    gerror = criterion(output, label)
    gerror.backward()

    # Update G
    gopt.step()
    
    if i % 1000 == 0:
        data = generator(5000)
        plot_density(data.cpu().data)
<Figure size 432x288 with 0 Axes>

Compare and discuss the results you obtained with the VAE and with the GAN approach.

------------------------------------------------------------------------------- ANSWER (BEGIN) -------------------------------------------------------------------------------

The GAN took much longer to train and at the same time never reconstructed the input to the degree the VAE did after double the number of iterations. I expect that it can be shown mathematically that the VAE should outperform the GAN on the stochastic process we have designed. Qualitatively, this is possibly due to the ability of the VAE to model the stochastic process we have designed more directly than the GAN. The VAE itself is modeling the input using a multivariate Gaussian and the toy example is just a set of 2D gaussians with means 1, 2 or 3.

------------------------------------------------------------------------------- ANSWER (END) -------------------------------------------------------------------------------

VAE and GANs on CelebA/MNIST

In this second part of the assignment you'll train a VAE and a GAN on a more interesting dataset. The cell below will try to download and load CelebA, and will just load MNIST in case there is an error.

It is likely that you'll get an error when trying to download CelebA since the its google drive is always out of quota. If you'd like to use CelebA anyway, you can try to download it from here http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html or some other source. If you're not running this notebook on a GPU then use MNIST instead.

In [ ]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Resize(64), transforms.CenterCrop(64), transforms.Normalize((0.5,), (0.5,))])
try:
    dataset = datasets.CelebA("data", split='all', download=True, transform=transform)
except:
    dataset = datasets.MNIST("data", train=True, download=True, transform=transform)
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

We'll use a a CNN for the VAE instead of the simple model we defined previously. Implement a network following these specifications:

  • Encoder. Should have 4 conv layers, each with kernel size 4, stride 2 and padding 1, which decrease the spatial resolution by half. The output of the 4th conv (a spatial res. 4x4) should be flattened and then fully connected layers should be used to compute mu and logvar. Add whichever activation function you prefer between the conv layers (ReLU, LeakyReLU, ELU, etc), and feel free to add batch norm as well. Let the first conv layer have, say, 8 or 16 channels and then double the number of channels at each following conv layer.

  • Decoder. Try to have an architecture that is roughly symmetric to the encoder. For example, start with a fully connected layer to project the latent_dim dimensional input such that you end up wuth a (12844)-dimensional vector that you can reshape into a 4x4 image. Then you can apply 4 transposed conv layers, e.g. with kernel size 4, stride 2 and padding 1, to double the spatial resolution with each layer, having a final output of size 64x64. Start with around 64 or 128 channels for the first transposed conv and then halve the number of channels at each following conv layer. As before, add your preferred activation function between layers, with or without batch norm.

The encode, sample, and decode methods have the same specification as before.

In [ ]:
class ConvVAE(nn.Module):
    def __init__(self, latent_dim, hidden_dim=1024):
        super(ConvVAE, self).__init__()

        self.latent_dim = latent_dim
        nc=1; ndf=8; ngf=8

        self.encoder = nn.Sequential(
            
            #Layer 1
            nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),

            #Layer 2
            nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 2),
            nn.LeakyReLU(0.2, inplace=True),

            #Layer 3
            nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 4),
            nn.LeakyReLU(0.2, inplace=True),

            #Layer 4
            nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
        )

        self.decoder = nn.Sequential(

            #Layer 1
            nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            #Layer 2
            nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            #Layer 3
            nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),

            #Layer 4
            nn.ConvTranspose2d(ngf, 1, 4, 2, 1, bias=False),
            nn.BatchNorm2d(1),
            nn.ReLU(True),
        )

        self.mu = nn.Linear(1024, latent_dim)
        self.logvar = nn.Linear(1024, latent_dim)
        self.hidden = nn.Linear(latent_dim, 1024)

    def sample(self, mu, logvar):
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        z = mu + (eps * std)
        return z

    def encode(self, input):
        conv = self.encoder(input)
        conv = conv.view(-1, 1024)
        mu, logvar = self.mu(conv), self.logvar(conv)
        return mu, logvar
      
    def decode(self, z):
        x = self.hidden(z)
        x = x.view(-1,64,4,4)
        out = self.decoder(x)
        return out
            
    def forward(self, input):

        mu, logvar = self.encode(input)
        z = self.sample(mu, logvar)
        decoded = self.decode(z)

        return mu, logvar, decoded
    
    def generate(self, n):
        z = torch.randn(n, self.latent_dim).cuda()
        samples = self.decode(z)
        return samples
In [ ]:
vae = ConvVAE(100).cuda()
opt = torch.optim.Adam(vae.parameters(), lr=5e-4)

The cell below applies a 'patch' in case you're using google colab (cv2_imshow doesn't work properly on google colab without it). Feel free to comment out the first import if you're not using google colab (you might have to add from cv2 import cv2_imshow, though).

In [9]:
from google.colab.patches import cv2_imshow
import cv2

def show(x):
    img = x.data.cpu().permute(1, 2, 0).numpy() * 255
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    cv2_imshow(img)

Again, try to find a value for beta that yields reasonable results.

In [ ]:
beta = 2
for epoch in range(100):
    for i, x in enumerate(loader):
        if len(x) == 2:
            x = x[0]
        x = x.cuda()
        
        mu, logvar, out = vae(x)
        rl, kl, l = loss(x, out, mu, logvar, beta)

        opt.zero_grad()
        l.backward()
        opt.step()

        if i == 0:
            vae.eval()
            data = vae.generate(8)
            grid_img = torchvision.utils.make_grid(data, nrow=8, normalize=True)
            show(grid_img)
            vae.train()
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-17-45ba46a0ce5c> in <module>()
      1 beta = 2
      2 for epoch in range(100):
----> 3     for i, x in enumerate(loader):
      4         if len(x) == 2:
      5             x = x[0]

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/mnist.py in __getitem__(self, index)
     97             tuple: (image, target) where target is index of the target class.
     98         """
---> 99         img, target = self.data[index], int(self.targets[index])
    100 
    101         # doing this so that it is consistent with all other datasets

KeyboardInterrupt: 

You'll also re-implement the Generator and Discriminator modules for the GAN, adopting a CNN-like architecture.

For the generator, implement a network similar to the one you used for the VAE decoder (fully connected for projection followed by 4 transposed convolutions), while for the discriminator you should use a network similar to the VAE encoder (4 conv layers with stride 2, but note that the output should be a scalar per image, not a latent vector).

In [2]:
#####Configuration cell#######

lr = 0.0002
num_epochs = 5 #number of epochs to train
batch_size = 32
nz = 100 #noise dimension
nc = 1 #number of channels in the training images
ngf = 64 #size of feature maps in generator
ndf = 64 #size of feature maps in discriminator
real_label = 1.
fake_label = 0.
In [3]:
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.main = nn.Sequential(
            #Prep input
            nn.ConvTranspose2d(nz, 8*ngf, 4, 1, 0, bias=False),
            nn.BatchNorm2d(ngf * 8),
            nn.ReLU(True),
            #Layer 1
            nn.ConvTranspose2d(8*ngf, 4*ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            #Layer 2
            nn.ConvTranspose2d(4*ngf, 2*ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            #Layer 3
            nn.ConvTranspose2d(2*ngf, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),
            #Layer 4
            nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
            nn.Tanh()
        )

    def forward(self, noise):
        #noise = torch.randn(batch_size, nz, 1, 1).cuda()
        return self.main(noise)
In [4]:
class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        self.main = nn.Sequential(
            #Prep input
            nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            #Layer 1
            nn.Conv2d(ndf, 2*ndf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 2),
            nn.LeakyReLU(0.2, inplace=True),
            #Layer 2
            nn.Conv2d(2*ndf, 4*ndf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 4),
            nn.LeakyReLU(0.2, inplace=True),
            #Layer 3
            nn.Conv2d(4*ndf, 8*ndf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(8*ndf),
            nn.LeakyReLU(0.2, inplace=True),
            #Layer 4
            nn.Conv2d(8*ndf, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)
In [5]:
def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1:
        nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find('BatchNorm') != -1:
        nn.init.normal_(m.weight.data, 1.0, 0.02)
        nn.init.constant_(m.bias.data, 0)
In [6]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Resize(64), transforms.CenterCrop(64), transforms.Normalize((0.5,), (0.5,))])
try:
    dataset = datasets.CelebA("data", split='all', download=True, transform=transform)
except:
    dataset = datasets.MNIST("data", train=True, download=True, transform=transform)
loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw
Processing...
Done!
/usr/local/lib/python3.6/dist-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
In [7]:
generator = Generator().cuda()
gopt = torch.optim.Adam(generator.parameters(), lr=lr, betas=(0.5, 0.999))
discriminator = Discriminator().cuda()
dopt = torch.optim.Adam(discriminator.parameters(), lr=lr, betas=(0.5, 0.999))
criterion = torch.nn.BCELoss()

#Initialize network weights
discriminator.apply(weights_init)
generator.apply(weights_init)
Out[7]:
Generator(
  (main): Sequential(
    (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
    (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU(inplace=True)
    (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): ReLU(inplace=True)
    (12): ConvTranspose2d(64, 1, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (13): Tanh()
  )
)
In [10]:
for epoch in range(num_epochs):
    for i, data in enumerate(loader):

        #Train the discriminator
        label = torch.full((batch_size,), real_label, dtype=torch.float).cuda()

        #Train with real batch
        discriminator.zero_grad()

        #Classify real batch, compute loss
        output = discriminator(data[0].cuda()).view(-1)
        derror_real = criterion(output, label)
        derror_real.backward()

        #Train with all-fake batch
        noise = torch.randn(batch_size, nz, 1, 1).cuda()
        fake = generator(noise)
        label.fill_(fake_label)

        # Classify fake batch, compute loss
        output = discriminator(fake.detach()).view(-1)
        derror_fake = criterion(output, label)
        derror_fake.backward()

        derror = derror_real + derror_fake
        dopt.step()

        #Train the generator 
        generator.zero_grad()
        label.fill_(real_label)

        #Classify fake batch
        output = discriminator(fake).view(-1)
        gerror = criterion(output, label)

        # Calculate gradients for G
        gerror.backward()

        # Update G
        gopt.step()
                
        if i == 0:
            grid_img = torchvision.utils.make_grid(fake[:8], nrow=8, normalize=True)
            show(grid_img)

Compare and discuss the results you obtained with the VAE and with the GAN approach for this new dataset. Which of the two approaches was able to generate more realistic samples? Which of the two did you feel that you understood better (there is no correct answer here), and why? Mention one advantage and disadvantage of each of the two methods -- these should be precise properties about each approach, with a focus on what each method can and cannot do. Feel free to check papers, the original GAN paper might be especially helpful.

------------------------------------------------------------------------------- ANSWER (BEGIN) -------------------------------------------------------------------------------

My implementation of the Convolutional VAE was not entirely functional in the end (I believe that the architecture is very close but there is a small bug or perhaps more layers in the decoder are required). However, my implementation of the DCGAN did work quite well after reading through the original paper and using some of the suggested constraints on the intialization of network parameters.

If the Convolutional VAE had worked, I would expect it to perform slightly worse than the GAN and that $\beta > 1$ would would have been optimal. Because of the variability inherent to handwritten digits, constraints on the latent distribution are arguably more important than an exact reconstruction. It would have performed worse than the GAN because the GAN places less constraints and has more "freedom" to learn an approximate population distribution of handwritten digits.

An advantage of the GAN is that it doesn't require knowledge of a prior over the latent space while this is a disadvantage for the VAE because it requires that you specify such a prior. However, perhaps the performance of the GAN is sensitive to the ininitialization of the network parameters (I did not test this for a variety of initializations).

In addition, the distribution over latent space of the VAE could provide useful information on the properties of the inputs which may be useful in certain applications (protein sequencing comes to mind) while the GAN does not provide that information as directly as the VAE.

------------------------------------------------------------------------------- ANSWER (END) -------------------------------------------------------------------------------

In [ ]: